Overview

This report provides an evaluation of the accuracy and precision of probabilistic forecasts of COVID-19 cases and deaths submitted to the US COVID-19 Forecast Hub. Some analyses include forecasts submitted starting in April 2020. Others focus on evaluating “recent” forecasts, submitted only in the last 10 weeks.

In collaboration with the US Centers for Disease Control and Prevention (CDC), the COVID-19 Forecast hub collects short-term COVID-19 forecasts from dozens of research groups around the globe. Every Tuesday morning we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the target submissions. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Wednesday.

Incident Case Forecasts

Summary Tables

The first table evaluates models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for all historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.

The second table evaluates models based on their prediction interval coverage at the 50% and 95% levels. Scores are aggregated seperately for the most recent 10 weeks and for all historical weeks.

Inclusion criteria for each column are detailed below the table.

Accuracy Table

To calculate each column in our table, different inclusion criteria were applied. This table only includes models that have submitted forecasts for at least 50% of forecasts for the last 10 weeks or at least 50% of forecasts since the first week in April.

  • The column titled, “n recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.

  • Columns 3 and 4 calculate the adjusted relative WIS over the most recent 10 week period and the adjusted relative MAE over the most recent 10 week period. For inclusion in these rows, a model must have forecasts for at least 50% of the evaluated forecasts in the most recent evaluation period.

  • Column 5 shows the number of historical models a team has submitted.

  • Columns 6 and 7 show the adjusted WIS and adjusted MAE over a historical period beginning the first week in March. For inclusion in this figure, a model must have submitted predictions for 50% or more of the evaluated forecasts in the historical evaluation period.

Coverage Table

For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since the beginning of April, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.

WIS components

The data in this graph has been aggregated over all locations and submission weeks. The models included have submitted forecasts for at least 50% of forecasts out of the last 10 weeks. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.

The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.

For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning the first week in April at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.

To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest. To view a specific time of interest, highlight that section on the graph or use the zoom functionality.

1 Week Horizon WIS

4 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error. There is often larger error for the 4 week horizon compared to the 1 week horizon.

1 Week Horizon 95% Coverage

We would expect a well calibrated model to have a value of 95% in this plot.

4 Week Horizon 95% Coverage

We would expect a well calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.

Evaluation by location

This figures below show model performance stratified by location. We have only included included models that have submitted forecasts for all 4 horizons and at least 50% of weeks over the past 10 evaluated weeks.

The color scheme shows the WIS score relative to the baseline. The only locations evaluated are 50 states and a national level forecast.

Observed data

This figure shows the number of incident reported COVID-19 cases reported each week in the US. The period between the vertical blue lines shows the weeks included in the “recent” model evaluations.

Incident Death Forecasts

Summary Tables

The first table below evaluates models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for all historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.

The second table evaluates models based on their prediction interval coverage at the 50% and 95% levels. Scores are aggregated seperately for the most recent 10 weeks and for all historical weeks.

Inclusion criteria for each column are detailed below the table.

Accuracy Table

In this table, we have included all models with an eligible WIS or MAE score.

In order to meet eligibility for adjusted relative WIS or MAE over the most recent 10 week period, a model must have submitted forecasts 50% or more of the evaluated forecasts in the most recent evaluation period. WIS was only calculated for teams that submitted all required quantiles.

In order to be eligible for the historical calculation of MAE or WIS, a model must have predictions for 50% or more of the evaluated forecasts in the historical evaluation period.

Coverage Table

For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since the beginning of April, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.

WIS components

The data in this graph has been aggregated over all locations and submission weeks. The models included have submitted forecasts for at least 50% of forecasts out of the last 10 weeks. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.

The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing.

The data are ordered on the x axis based on their relative WIS score shown in the accuracy table.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.

For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations for a 4 week horizon.

To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.To view a specific time of interest, highlight that section on the graph or use the zoom functionality.

1 Week Horizon

4 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.

1 Week Horizon 95% Coverage

The black line represents 95%

4 Week Horizon 95% Coverage

The black line represents 95%

Evaluation by location

This figures below show model performance stratified by location In this figure, we only include models that have submitted forecasts for all 4 horizons and at least 50% of the last 10 evaluated weeks.

The color scheme shows the WIS score relative to the baseline. The only locations evaluated are 50 states and a national level forecast.

Observed data

This plot shows the observed number of incident deaths over time in the US. The period between the vertical blue lines shows the weeks included in the “recent” model evaluations.